Categories

Versions

You are viewing the RapidMiner Studio documentation for version 10.1 - Check here for latest version

Get Pages (Web Mining)

Synopsis

Gets pages from URLs in an attribute and stores them into a new attribute.

Description

This operator retrieves pages, whose URLs are contained in the input data set. For each row in the data set, the URL is extracted from the specified attribute. A GET request is sent and a page is acquired. This page is stored in a new attribute specified by the parameter page attribute.

Input

  • Example Set (Data Table)

    The Example Set port.

Output

  • Example Set (Data Table)

    The Example Set port.

Parameters

  • link_attributeThe attribute that contains the URLs. Range:
  • page_attributeThe name of the attribute that should contain the pages. Range:
  • random_user_agentChoose a user agent randomly from a set of 7000 user agents Range:
  • user_agentThe user agent property. Range:
  • connection_timeoutThe timeout (in ms) for the connection. Range:
  • read_timeoutThe timeout (in ms) for reading from the URL. Range:
  • follow_redirectsSpecifies, whether redirects should be followed. Range:
  • accept_cookiesSpecifies, whether cookies should be accepted. Range:
  • cookie_scopeSpecifies the scope of the cookies used Range:
  • request_methodSpecifies the request method. Range:
  • delaySpecifies whether execution should not be delayed, delayed by a fixed or random amount of time. Range:
  • delay_amountThe delay amount in ms. Range:
  • min_delay_amountThe minimum delay amount in ms. Range:
  • max_delay_amountThe maximum delay amount in ms. Range: